607 Final Project

Brian Weinfeld

Mary 5, 2018

Introduction

What is the relationship between the Top Grossing Blockbusters of the last decade and the gender of the movie’s main stars?

Top.Movie.Query <- function(years, rank){
  years %>%
    map_(~getURL(paste0('http://www.boxofficemojo.com/yearly/chart/?yr=', .x, '&p=.htm')) %>% 
              htmlParse() %>%
              xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
              .[15:914] %>%
              matrix(ncol=9, byrow=T) %>%
              as.data.frame() %>%
              filter(row_number() <= rank) %>%
              mutate(Movie=str_replace(V2, paste0('(.*?)( \\(', .x, '\\))$'), '\\1'),
                     Year = .x) %>%
              select(Movie, Year)
    )
}

top.movies <- Top.Movie.Query(2017:2008, 50)

all.movies <- map2_df(top.movies$Movie, top.movies$Year, ~Movie.API.Query(.x, .y))

The first function Top.Movie.Query scrapes boxofficemojo.com for the names of the top 50 domestic grossing blockbusters between 2008 and 2017

Movie.API.Query <- function(movie, year){
  print(movie)
  initial.query <- GET('http://www.omdbapi.com/', 
      add_headers('Content-Type'='application/json', 'Accept-Encoding'='gzip'),
      query=list('t'=movie, 'apikey'=apikey, 'y'=year, 'plot'='full')
  ) %>%
  content(as='text') %>%
  fromJSON(flatten=FALSE) %>%
  .[-15] %>%
  as.tibble()
  if(ncol(initial.query) == 2){
    print('Movie Not Found!')
    tibble(Title=movie, Year=as.character(year))
  }else{
    initial.query %>%
      select(c(1, 2, 3, 5, 6, 9, 10, 14, 15, 18, 21)) %>%
      mutate(Genre = str_extract(Genre, '([^,]+)'),
             Runtime = str_extract(Runtime, '(\\d+)'),
             Actors = IMDB.Star.Query(imdbID),
             BoxOffice = parse_number(BoxOffice)
      ) %>%
      separate(Actors, c('Lead_1', 'Lead_2'), sep=', ') %>%
      mutate(Lead_1_Male = Wikipedia.Gender.Query(Lead_1),
             Lead_2_Male = Wikipedia.Gender.Query(Lead_2)
      ) %>%
      select(c(1:6, 13, 7, 14), everything())
  }
}

The second function Movie.API.Query accesses an API that called OMDB and requests each of the movies. This function called two other functions to fill in missing information, namely the stars of the movie and the genders of those stars.

IMDB.Star.Query <- function(movie.id){
  Sys.sleep(1)
  getURL(paste0('https://www.imdb.com/title/', movie.id,'/')) %>% 
    htmlParse() %>%
    xpathSApply('//*[@id="title-overview-widget"]//span[@itemprop="actors"]//a', xmlValue) %>%
    .[1:2] %>%
    paste(collapse=', ')
}
Wikipedia.Gender.Query <- function(lead){
  Sys.sleep(0.5)
  lead <- str_replace_all(lead, ' ', '_')
  initial.query <- getURL(paste0('https://en.wikipedia.org/wiki/', lead)) %>% 
            htmlParse() %>%
            xpathSApply('//*[@id="mw-content-text"]/div/p[position()<3]', xmlValue) %>%
            unlist()  %>%
            paste(collapse='')
  if(str_detect(initial.query, 'may refer to:')){
    initial.query <- getURL(paste0('https://en.wikipedia.org/wiki/', lead, '_(actor)')) %>% 
      htmlParse() %>%
      xpathSApply('//*[@id="mw-content-text"]/div/p[position()<3]', xmlValue) %>%
      unlist() %>%
      paste(collapse='')
  }
  if(str_detect(initial.query, 'actor') & !str_detect(initial.query, 'actress')){
    return(TRUE)
  }else if(str_detect(initial.query, 'actress') & !str_detect(initial.query, 'actor')){
    return(FALSE)
  }else{
    return(NA)
  }
}

The IMDB.Star.Query uses the url provided by the API to scrape the two main leads of each film. The Wikipedia.Gender.Query scrapes Wikipedia in an effort to determine the gender of the star by looking for the woards ‘actor’ or ‘actress’. If a determination cannot be made, it returns null. This method had over a 95% success rate.

Title Year Rated Runtime Genre Lead_1 Lead_1_Male Lead_2 Lead_2_Male BoxOffice Type
The Dark Knight 2008 PG-13 152 Action Christian Bale TRUE Heath Ledger TRUE 533316061 Male/Male
Avatar 2009 PG-13 162 Action Sam Worthington TRUE Zoe Saldana FALSE 749700000 Male/Female
Marvel’s The Avengers 2012 PG-13 143 Action Robert Downey Jr. TRUE Chris Evans TRUE 623357910 Male/Male
The Dark Knight Rises 2012 PG-13 164 Action Christian Bale TRUE Tom Hardy TRUE 448130642 Male/Male
Star Wars: The Force Awakens 2015 PG-13 136 Action Daisy Ridley FALSE John Boyega TRUE 936658640 Female/Male
Jurassic World 2015 PG-13 124 Action Chris Pratt TRUE Bryce Dallas Howard FALSE 528757749 Male/Female
Rogue One: A Star Wars Story 2016 PG-13 133 Action Felicity Jones FALSE Diego Luna TRUE 532171696 Female/Male
Finding Dory 2016 PG 97 Animation Ellen DeGeneres FALSE Albert Brooks TRUE 486292984 Female/Male
Star Wars: The Last Jedi 2017 PG-13 152 Action Daisy Ridley FALSE John Boyega TRUE 619117636 Female/Male
Beauty and the Beast 2017 PG 129 Family Emma Watson FALSE Dan Stevens TRUE 503974884 Female/Male

Initial Analysis

Type n
Female/Female 32
Female/Male 79
Male/Female 169
Male/Male 220

Title Year BoxOffice Type
Frozen 2013 $400,736,600 Female/Female
Maleficent 2014 $190,871,149 Female/Female
Cinderella 2015 $183,327,144 Female/Female
The Help 2011 $169,705,587 Female/Female
Hidden Figures 2016 $169,385,416 Female/Female
Title Year BoxOffice Type
Star Wars: The Force Awakens 2015 $936,658,640 Female/Male
Star Wars: The Last Jedi 2017 $619,117,636 Female/Male
Rogue One: A Star Wars Story 2016 $532,171,696 Female/Male
Beauty and the Beast 2017 $503,974,884 Female/Male
Finding Dory 2016 $486,292,984 Female/Male
Title Year BoxOffice Type
Avatar 2009 $749,700,000 Male/Female
Jurassic World 2015 $528,757,749 Male/Female
Transformers: Revenge of the Fallen 2009 $402,076,689 Male/Female
Jumanji: Welcome to the Jungle 2017 $393,201,353 Male/Female
Guardians of the Galaxy Vol. 2 2017 $389,804,217 Male/Female
Title Year BoxOffice Type
Marvel’s The Avengers 2012 $623,357,910 Male/Male
The Dark Knight 2008 $533,316,061 Male/Male
The Dark Knight Rises 2012 $448,130,642 Male/Male
Avengers: Age of Ultron 2015 $429,113,729 Male/Male
Toy Story 3 2010 $414,984,497 Male/Male
Shiny applications not supported in static R Markdown documents

Also can put rating breakdown here if needed

Sentiment Analysis

Suggestions

Type word n tf idf tf_idf
Male/Male frustration 1 0.0003729 1.3862944 0.0005169
Male/Male mystery 3 0.0011186 0.2876821 0.0003218
Male/Male prolong 1 0.0003729 1.3862944 0.0005169
Male/Male colonel 4 0.0014914 0.6931472 0.0010338
Male/Male major 3 0.0011186 0.6931472 0.0007753
Male/Male drone 1 0.0003729 1.3862944 0.0005169
Male/Male survive 11 0.0041014 0.6931472 0.0028429
Male/Male specialist 2 0.0007457 1.3862944 0.0010338

Amy Adams and Cameron Diaz star in “Action Blockbuster”! Frustrated by her commanding officer’s unwillingness to address an ongoing civil war on a foreign island nation, Major Jennifer Slater (Cameron Diaz) enlists the help of surivial specialist Annie (Amy Adams). Together they journey to the secretive island in an effort to end the prolonged conflict. But what they discover there will shake the world to it’s very core. Can they solve the mystery of the island before Jennifer’s reneage Colonel can nuke the island via drone? You won’t want to miss a moment of “Action Blockbuster!”

Shiny applications not supported in static R Markdown documents